First, we start by loading the packages and data we’ll need.
# add packages here. I'd suggest starting with tidyverse, but there are others you
# may want to add as well
# library(package)
library(psych)
library(plyr)
library(tidyverse)## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::id() masks plyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
library(broom)# first, let's set the path to the data below. This will be the path to the folder on
# your computer. On a mac, the root directory can be accessed by including "~/" in your
# file path
# example wd <- "~/Dropbox/Summer 2018"
# add your file path below
wd <- "https://github.com/emoriebeck/R-tutorials/blob/master/RA_Files/Week_2_Correlation/data_cleaning"
# now let's load the data. I like to use something like the line below, which uses
# the `sprintf()` function to merge the file path with the name of the file
# file <- sprintf("%s/data/target_w1.csv", wd)
# pairsW1 <- read.csv(file, stringsAsFactors = F)
# do this below
file <- paste(wd, "home_w1_redacted.csv?raw=true", sep = "/")
file <- url(sprintf("%s/home_w1_redacted.csv?raw=true", wd))
pairsW1 <- file %>% readr::read_csv()## Parsed with column specification:
## cols(
## .default = col_integer(),
## ts.RS03.w1 = col_double(),
## ts.RS04.w1 = col_double(),
## ts.RS05.w1 = col_double(),
## ts.RS06.w1 = col_double(),
## ts.i1.NOM09.w1 = col_double(),
## ts.i1.NOM11.w1 = col_double(),
## ts.i2.NOM06.w1 = col_double(),
## ts.i2.NOM07.w1 = col_double(),
## ts.i3.NOM08.w1 = col_double(),
## ts.i7.NOM06.w1 = col_double(),
## ts.AGQ03.w1 = col_double(),
## ts.DD.NQ09.2.w1 = col_double(),
## ts.DD.NQ09.4.w1 = col_double(),
## ts.CRT.ss.w1 = col_double()
## )
## See spec(...) for full column specifications.
pairsW1Now that the data are loaded, we usually need to make modifications, keeping some columns, removing others, creating composites, apply exclusion criteria, etc.
To do this, it’s usually smart to look at some descriptives of your data, especially if you aren’t very familiar with it.
One useful function for this is the describe() function in the psych package
# try using the describe function in the chunk below:
# describe(pairsW1)
# are there any variables that stand out as weird?
describe(pairsW1)When I work with data, I’m usually pulling a few variables from data sets that may contain thousands of variables. I often find the easiest way to do this is to use the codebooks associated with the study to create an Excel sheet that lists a number of properties, including the original column name of the item in the data, a new name that I’d like to give it, the purpose of the variable (e.g. outcome variable, predictor variable, covariate, etc.), whether I need to reverse score it, the scale participants responded on, what values represent missing values (this will vary by data set). See the example I’ve provided and create your own codebook of variables you find interesting.
# load the codebook
# codebook <- sprintf("%s/codebook.xlsx")
# codebook <- readxl::read_xlsx(codebook)
destfile <- "Codebook.xlsx"
curl::curl_download(sprintf("%s/Codebook.xlsx?raw=true", wd), destfile)
codebook <- readxl::read_excel(destfile)
# Now we need to find a way to match the variables in the codebook with the variables in
# your data. I'm not going to give you explicit instructions. Struggle a bit, and try
# different ways. I'll give you a couple of hints below
# old_names <- codebook$old_names
# new_names <- codebook$new_names
old_names <- codebook$old_name
new_names <- codebook$new_name
# somehow, you need to subset your data so only the variables in your codebook are left
# in your data frame. Refer to the basics of subsetting data in R (e.g. df[,cols]) and
# remember what you learned in the intro to tidyverse data camp course
# base R way
pairsW1.base <- pairsW1[,paste(old_names, "w1", sep = ".")]
colnames(pairsW1.base) <- new_names
# dplyr way
pairsW1.tidy <- pairsW1 %>%
select(one_of(paste(old_names, "w1", sep = "."))) %>%
setNames(new_names)
# then you need to rename those variables as you've specified in your codebook
# functions like `colnames()` and `setNames()` might be useful hereWe want to make sure R recognizes missing values as missing. Below, you’re going to need to make sure that missing values are specified as NA, not as another value.
# missings are coded as -7 in the data. Change these to NA values.
# base R
pairsW1.base <- apply(pairsW1.base, 2, function(x) ifelse(x == -7, NA_real_, x))
pairsW1.base <- data.frame(pairsW1.base)
# psych package
pairsW1.psych <- scrub(pairsW1.tidy, new_names, isvalue = -7, newvalue = NA_real_)
# tidyverse
pairsW1.tidy <- pairsW1.tidy %>% mutate_all(funs(mapvalues(., from = -7, to = NA_real_, warn_missing = F)))Almost always, there are variables that are not coded how we want. It could be an ordinal or nominal scale that has levels we don’t want or needs reverse coded. It could be we want to collapse variables. Whatever it is, it’s important to do.
# recode a nominal variable with 3+ levels into a nominal variable with 2 levels
# try the `recode()` function and also refer to base code techniques
# now reverse code items that need reverse coded
# hint: you can create a vector of keys from your codebook
# hint: you could also try moving your data from wide to long format then using full_join to merge your codebook with your data
# then you can use the psych package function `reverse.code()` to reverse code your data
# psych package
keys <- codebook$rev_code[codebook$rev_code == -1]
items <- codebook$new_name[codebook$rev_code == -1]
pairsW1.psych[,items] <- reverse.code(keys, pairsW1.psych[,items], mini = 1, maxi = 15)
# base R -- don't go there
# tidyverse
items <- (codebook %>% filter(rev_code == -1))$new_name
pairsW1.tidy <- pairsW1.tidy %>%
gather(key = new_name, value = value, one_of(items), na.rm = T) %>%
left_join(codebook %>% select(new_name, rev_code, mini, maxi)) %>%
mutate(value = ifelse(rev_code == -1, reverse.code(-1, value, mini, maxi), value)) %>%
spread(key = new_name, value = value)## Joining, by = "new_name"
Usually, there are variables that we need to composite. It could be a personality scale, it could be the same item answered over time. Whatever it is, you’ll do it a lot. There are lots of ways to go about it. Below, you should try to do this a couple of different ways and see if you achieve the same results.
# first, try the dplyr way: get your data in "tidy" format, use the
# `group_by()` function to group by person, scale, etc., then use the
# `summarize()` function to create the composites
items <- (codebook %>% filter(scale == "yes"))$new_name
pairsW1.tidy <- pairsW1.tidy %>%
gather(key = item, value = value, items, na.rm = T) %>%
separate(item, c("scale", "item"), sep = "_") %>%
group_by(SID, scale) %>%
summarize(value = mean(value, na.rm = T)) %>%
spread(key = scale, value = value) %>%
full_join(pairsW1.tidy) %>%
ungroup() ## Joining, by = "SID"
# Now, try the base R way.
# hint: try using the function `rowMeans()`. You can select multiple columns using
# the `cbind()` function within `rowMeans()`. Don't forget to set na.rm = T
# personality
items <- codebook$new_name[grepl("BFI.E", codebook$new_name)]
pairsW1.base$BFI.E <- rowMeans(pairsW1.base[,items], na.rm = T)
items <- codebook$new_name[grepl("BFI.A", codebook$new_name)]
pairsW1.base$BFI.A <- rowMeans(pairsW1.base[,items], na.rm = T)
items <- codebook$new_name[grepl("BFI.C", codebook$new_name)]
pairsW1.base$BFI.C <- rowMeans(pairsW1.base[,items], na.rm = T)
items <- codebook$new_name[grepl("BFI.N", codebook$new_name)]
pairsW1.base$BFI.N <- rowMeans(pairsW1.base[,items], na.rm = T)
items <- codebook$new_name[grepl("BFI.O", codebook$new_name)]
pairsW1.base$BFI.O <- rowMeans(pairsW1.base[,items], na.rm = T)
# life satisfaction
items <- codebook$new_name[grepl("^Sat", codebook$new_name)]
pairsW1.base$lifesat <- rowMeans(pairsW1.base[,items], na.rm = T)round(cor(pairsW1.tidy %>% select(BFI.A:Sat), use = "pairwise"),2)## BFI.A BFI.C BFI.E BFI.N BFI.O Sat
## BFI.A 1.00 0.21 0.13 -0.34 0.07 0.31
## BFI.C 0.21 1.00 0.09 -0.15 -0.06 0.27
## BFI.E 0.13 0.09 1.00 -0.27 0.14 0.27
## BFI.N -0.34 -0.15 -0.27 1.00 -0.08 -0.41
## BFI.O 0.07 -0.06 0.14 -0.08 1.00 0.04
## Sat 0.31 0.27 0.27 -0.41 0.04 1.00
print(corr.test(pairsW1.tidy %>% select(BFI.A:Sat)), short = F)## Call:corr.test(x = pairsW1.tidy %>% select(BFI.A:Sat))
## Correlation matrix
## BFI.A BFI.C BFI.E BFI.N BFI.O Sat
## BFI.A 1.00 0.21 0.13 -0.34 0.07 0.31
## BFI.C 0.21 1.00 0.09 -0.15 -0.06 0.27
## BFI.E 0.13 0.09 1.00 -0.27 0.14 0.27
## BFI.N -0.34 -0.15 -0.27 1.00 -0.08 -0.41
## BFI.O 0.07 -0.06 0.14 -0.08 1.00 0.04
## Sat 0.31 0.27 0.27 -0.41 0.04 1.00
## Sample Size
## [1] 414
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## BFI.A BFI.C BFI.E BFI.N BFI.O Sat
## BFI.A 0.00 0.00 0.04 0.00 0.44 0.00
## BFI.C 0.00 0.00 0.35 0.02 0.44 0.00
## BFI.E 0.01 0.07 0.00 0.00 0.03 0.00
## BFI.N 0.00 0.00 0.00 0.00 0.44 0.00
## BFI.O 0.15 0.20 0.00 0.11 0.00 0.44
## Sat 0.00 0.00 0.00 0.00 0.43 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
##
## Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
## raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## BFI.A-BFI.C 0.12 0.21 0.30 0.00 0.08 0.34
## BFI.A-BFI.E 0.04 0.13 0.23 0.01 0.01 0.26
## BFI.A-BFI.N -0.43 -0.34 -0.25 0.00 -0.46 -0.21
## BFI.A-BFI.O -0.02 0.07 0.17 0.15 -0.05 0.19
## BFI.A-Sat 0.22 0.31 0.40 0.00 0.18 0.44
## BFI.C-BFI.E -0.01 0.09 0.18 0.07 -0.04 0.21
## BFI.C-BFI.N -0.24 -0.15 -0.05 0.00 -0.28 -0.01
## BFI.C-BFI.O -0.16 -0.06 0.03 0.20 -0.17 0.05
## BFI.C-Sat 0.18 0.27 0.36 0.00 0.14 0.40
## BFI.E-BFI.N -0.36 -0.27 -0.18 0.00 -0.40 -0.14
## BFI.E-BFI.O 0.04 0.14 0.23 0.00 0.01 0.27
## BFI.E-Sat 0.18 0.27 0.35 0.00 0.13 0.39
## BFI.N-BFI.O -0.17 -0.08 0.02 0.11 -0.20 0.04
## BFI.N-Sat -0.49 -0.41 -0.33 0.00 -0.52 -0.28
## BFI.O-Sat -0.06 0.04 0.13 0.43 -0.06 0.13
# Does Extraversion Predict Life Satisfaction
fitE <- lm(Sat ~ BFI.E, data = pairsW1.tidy)There are lots of helper functions for regression, like:
summary(), which prints a summary of the results
summary(fitE)##
## Call:
## lm(formula = Sat ~ BFI.E, data = pairsW1.tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2532 -1.2982 0.1209 1.3994 4.9508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.62125 0.32608 26.439 < 2e-16 ***
## BFI.E 0.19199 0.03406 5.637 3.21e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.989 on 412 degrees of freedom
## Multiple R-squared: 0.07161, Adjusted R-squared: 0.06935
## F-statistic: 31.78 on 1 and 412 DF, p-value: 3.212e-08
coef(), which prints the coefficients of the model
coef(fitE)## (Intercept) BFI.E
## 8.6212501 0.1919857
residuals(), which prints the residuals of the model
head(residuals(fitE))## 1 2 3 4 5 6
## -0.04110684 -1.60508713 2.41481435 0.44286450 -0.79707280 2.24682689
predict(), which generates predicted values for all the observations of X, using the model
head(predict(fitE))## 1 2 3 4 5 6
## 10.541107 10.805087 9.485186 10.157135 10.997073 9.653173
The broom package offers a solution for annoying S3 class stuff associated with lm-class in R.
It has great functions like:
tidy(), which prints a data frame of the coefficients, with the standard errors, t-test statistics associated with the estimates, and p values.
tidy(fitE)glance(), which summarizes model fit indices, like \(R^2\)
glance(fitE)augment(), which adds columns with the predictions, residuals, etc. for each data observation
augment(fitE)Run Cronbach’s Alpha - this is actually a really scale to check. We just need something demonstrative.
psych::alpha(pairsW1.tidy %>% select(contains("BFI.E_")))##
## Reliability analysis
## Call: psych::alpha(x = pairsW1.tidy %>% select(contains("BFI.E_")))
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd
## 0.9 0.9 0.91 0.53 9.1 0.0074 9.1 2.9
##
## lower alpha upper 95% confidence boundaries
## 0.89 0.9 0.91
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se
## BFI.E_1 0.88 0.88 0.88 0.51 7.3 0.0090
## BFI.E_3 0.88 0.88 0.89 0.52 7.5 0.0088
## BFI.E_4 0.89 0.89 0.89 0.54 8.2 0.0081
## BFI.E_5 0.90 0.91 0.91 0.58 9.6 0.0071
## BFI.E_8 0.89 0.89 0.89 0.53 8.1 0.0082
## BFI.E_2 0.88 0.89 0.89 0.53 7.8 0.0087
## BFI.E_6 0.89 0.89 0.89 0.54 8.1 0.0084
## BFI.E_7 0.88 0.88 0.88 0.51 7.4 0.0092
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## BFI.E_1 413 0.83 0.84 0.82 0.78 10.2 3.6
## BFI.E_3 414 0.81 0.81 0.79 0.75 10.3 3.5
## BFI.E_4 414 0.73 0.75 0.71 0.65 10.3 3.4
## BFI.E_5 414 0.61 0.61 0.51 0.49 9.0 3.9
## BFI.E_8 414 0.74 0.76 0.72 0.67 10.1 3.3
## BFI.E_2 414 0.80 0.79 0.76 0.72 8.1 4.0
## BFI.E_6 413 0.77 0.76 0.72 0.68 6.6 4.0
## BFI.E_7 414 0.85 0.84 0.82 0.78 8.5 4.3
a_data <- broom::augment(fitE)
ggplot(a_data, aes(.fitted, .resid)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_bw()ggplot(a_data, aes(x = BFI.E, y = .resid)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_classic()ggplot(a_data, aes(x = .resid)) +
geom_density() +
theme_classic()## Q-Q plot
# dark line is the theoretical normal distribution
# data points for those are close to theoretical
# dotted lines are the confidence bands around it
# want the data points within the confidence bands
# if there are just a couple outside of this, it isn't a huge deal
library(car)##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:psych':
##
## logit
qqPlot(fitE)